Name: Hicham Mallah
Email: hicham.mallah@gmail.com

Getting to know the data

In this project we’ll be exploring some chemical ingredients of red wine, and how does these affect the quality of our wine.

First of all let’s check out what are these ingredients, by looking at the column names and their types.

names(wine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
str(wine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Plot and explore

Counts by qualities

Let’s now look at how many wines of each quality do we have, we can notice that the most quality we have in our data is quality 5 with 681 wines, and the second is quality 6 with 638.
Then we have 199 wines of quality 7, 53 of 4, 10 of 3 and 18 of 8.

So basically most of our wines lie between qality 5 and 6.

ggplot(aes(x=quality), data=wine)+
  geom_histogram(binwidth=.1, fill='blue')+
  scale_x_continuous(breaks=c(3:8))+
  scale_y_continuous(breaks=seq(0, 700, 50))+
  ggtitle('Wine Quality Counts')

GGPairs plot to examine our variables against each others

I would like to start by exploring what are the relationships between the variables in my data. Best thing to do that is to explore our data using ggpairs function in the GGally library.

ggpairs(wine,  binwidth=1)+
  theme(panel.border = 
          element_rect(linetype = "dashed", colour = "blue", fill = NA))


Looking at the output generated by ggpairs function, we can notice that the wine quality is somewhat positively related to the amount of alcohol, that’s why looking at the scatterplot for variables alcohol and quality we can see the points moving a bit to the right (higher alcohol percentage) as we move up towards a better quality.

Another thing we notice is that quality is somewhat negatively related to volatile.acidity that’s why the points move a bit left whenever the quality goes higher.

Other than these 2 variables nothing seems to be really affecting quality in our variables, or at least not as much as alcohol and volatile acidity.

We’ll have to explore more…

Before continuing, I want to create a new column in our data set, which is the same as the quality column but as a factor.

wine$quality.factor=factor(wine$quality)


Looking into alcohol counts
ggplot(aes(x=round(alcohol, digits=1)), data=wine)+
  geom_bar(binwidth=.1, fill='blue')+
  scale_x_continuous(limits=c(9, 13.7), breaks=seq(9, 13.7, .1))+
  scale_y_continuous(limits=c(0, 140), breaks=seq(0, 140, 10))+
  xlab('Alcohol round to 1 digit after the decimal')+
  ggtitle('Wine by Alcohol Counts')
## Warning: position_stack requires constant width: output may be incorrect

I rounded the alcohol percentage to 1 digit after the decimal, because we have some wines with many digits after the decimal.
We can see that most of our alcohols are in the range of 9.2% and 11.2% of alcohol. The highest count is for alcohol 9.4% with 139 wines.

Looking into pH

An interesting thing that I would like to do, is look at how our wines are separated by pH. Let’s first create groups of pH, in intervals of .5 and check how much each group will contain.

wine$pH.groups=cut(wine$pH, breaks=c(2.5, 3, 3.5, 4, 4.5))
ggplot(aes(x=pH.groups), data=wine)+
  geom_histogram(fill='blue', width=1)+
  scale_y_continuous(breaks=seq(0, 1400, 50))
Looking at the plot, we can easily notice that most of our wines have pH between 3 and 3.5. With less than 50 wines between 2.5 and 3, about 1390 lie between 3 and 3.5, 160 in the 3.5-5 group, 1 or 2 in the 4-4.5. Most of our wines are well acidic.

Quality vs. alcohol colored by volatile acidity

We noticed before that the quality is mostly affected by the amount of alcohol in the wine, and volatile acidity. Let’s explore these 2 variables in a plot.

ggplot(aes(x=alcohol, y=quality, color=volatile.acidity),
       data=wine)+
  geom_point(size=4, position="jitter")+
  scale_color_gradient(low="#1d0000", name="Volatile Acidity", high="#ff6161")+
  ggtitle('Quality vs. Alcohol colored by Volatile Acidity')
Looking at the plot, this somehow proves what we noticed before. As we can see, quality gets higher as the points move a bit right towards higher alcohol percentage, the other thing we should notice, is that the higher the quality goes the less clear red points that we see, which means that volatile acidity is getting lower on higher qualities.

Some interesting info

Checking ggpairs output again, I noticed 2 things, fixed acidity is mostly affected by citric acid, that’s logical, what we can also see is that citric acid is affected by sulphates, and what intrigued me more, is how much density is affected by fixed acidity, so let’s take a closer look into these.
Fixed acidity vs. citric acid colored by volatile acidity sized by pH
ggplot(aes(x=citric.acid, y=fixed.acidity, 
           color=volatile.acidity, size=pH), data=wine)+
  geom_jitter()+
  scale_y_continuous(limits=c(5,12), breaks=seq(5,12,1))+
  scale_x_continuous(limits=c(0, 0.75))+
  scale_color_gradient(low="#1d0000", name="Volatile Acidity", high="#ff6161")+
  scale_size(name="pH", range = c(2, 8))+
  ggtitle('Fixed Acidity vs. Citric Acid 
          Colored by Volatile Acidity Sized by pH')
## Warning: Removed 134 rows containing missing values (geom_point).
In this plot we can see how fixed acidity increases with citric acid, we can also notice that the points get darker and smaller citric acid and fixed acidity increase. This means that our pH is getting lower, more acidic, as these variables increase, and that our volatile acidity is decreasing.
The 2 warnings we’ve got is because of the limiting the x-axis to be between 0 and 0.75 and y-axis between 5 and 12. I did that to be able to zoom in on the area that is most interesting of the plot.

Density vs. fixed acidity colored by citric acid
ggplot(aes(x=fixed.acidity, y=density, color=citric.acid), data=wine)+
  geom_jitter(size=3)+
  geom_smooth(method='loess')+
  scale_color_gradient(low="#1d0000", name="Citric Acid", high="#ff6161")+
  ggtitle('Density vs. Fixed Acidity')
As we notice, from the points and the smooth line, density is visibly positively correlated, and as we knew from the plot before, fixed acidity is affected mostly by citric acid. Here we can see all of this, we can see that density is getting higher as fixed acidity increases and we can see that the points are moving towards clear red, higher citric acid, as these 2 variables increase.

How is density affected by alcohol

On the other hand, again looking at the ggpairs output, we can notice that desnity is negatively related to alcohol, so the higher the alcohol the lower the density. Let’s look into that.

ggplot(aes(x=alcohol, y=density), data=wine)+
  geom_line(stat='summary', fun.y=median, color='blue')+
  geom_smooth(method="loess")+
  ggtitle('Density vs. Alcohol')
As we said before, we can notice in this plot how the density is decreasing as alcohol percentage is going up. By using, “fun.y=median”, we’re telling R to plot the median of each alcohol level, so for each alcohol level, R groups them and extracts their median density and plots it. Using this way we might lose some points, but we remove over plotting to be able to read our plot.
Comparing how density is affected by alcohol and fixed acidity
p1=ggplot(aes(x=alcohol, y=density), data=wine)+
  geom_line(stat='summary', fun.y=median, color='blue')+
  ggtitle('Density vs. Alcohol (blue) - Density vs. Fixed Acidity (red)')
p2=ggplot(aes(x=fixed.acidity, y=density), data=wine)+
  geom_line(color='red', stat='summary', fun.y=median)

grid.arrange(p1, p2, ncol=1)


Here we can clearly see how density is mostly going up as fixed acidity increases (Red line), and how density is going down as alcohol is increasing (Blue line). By using, “fun.y=median”, we’re telling R to plot the median of each x variable (alchol and fixed.acidity), so for each x variable value, R groups the densities extracts their median and plots it.

Residual sugar effect on density
ggplot(aes(x=residual.sugar, y=density), data=subset(wine, residual.sugar<=4))+
  geom_point()+
  scale_x_continuous(breaks=seq(0, 4, .1))+
  geom_smooth(method='loess')+
  ggtitle('Density vs. Residual Sugar')


Density is also positively affected by residual sugars. I took a subset of our data where residual.sugar is less than or equal to 4, that’s just because that’s where all the action is going on.

Free sulfur dioxided vs. total sulfur dioxide

One last thing I would like to look at is free sulfur dioxide against total sulfur dioxide. According to this article sulfur dioxide is an important ingredient that existed since the beggining of time of wine. So let’s take a look into that.

ggplot(aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide, 
           color=quality.factor), 
       data=subset(wine, free.sulfur.dioxide<=55 & total.sulfur.dioxide<=150))+
  geom_point()+
  scale_color_manual(values=rev(brewer.pal(9, "Blues")), name="Quality")+
  scale_size_discrete(name="Quality")+
  ggtitle('Total Sulfur Dioxide vs. Free Sulfur Dioxide, 
          Colored and Sized by Quality')
Looking at the plot above, for sure we see that total sulfur dioxide is increasing as free sulfur dioxide increases, that’s logical. But I don’t really think that sulfur dioxide is linked to quality, it’s true that it actually acts as a wine conservative, but it doesn’t have direct effect on wine quality. Maybe with age it does…

Back to real work

Enough wondering around, let’s get back to quality exploration.

Volatile acidity vs. citric acid, colored and sized by quality

I was wondering why are volatile acidity and citric acid negative correlated, so I did some research online to understand. I ended up understading that a wine with a low acidity is considered as boring, and one is very high acidity will lead to a sour wine. That’s why wine makers have to know how to balance the acidity of the wine by balancing these 2 acids, volatile and citric.
Article link
Here’s a plot to show us the relation between citric and volatile acidity:

ggplot(aes(x=volatile.acidity, y=citric.acid, 
           color=quality.factor),
       data=subset(wine, volatile.acidity>=.2 & volatile.acidity<=1.2))+
  geom_point(stat='summary', fun.y=median)+
  scale_color_manual(values=rev(brewer.pal(9, "Blues")), name="Quality")+
  scale_size_discrete(name="Quality")+
  scale_x_continuous(breaks=seq(0.2, 1.2, .2))+
  ggtitle('Volatile Acidity vs. Citric Acid, colored and sized by Quality')
In the plot we can notice that the points are mostly following a descending curve, most of the wines are on the curve, what we can also notice that when some points are going out of the curve, over the curve, their quality is going gradually lower.

Revisit pH.groups plot, color by quality

Let’s see if we can prove this, by relooking at the plot we did above for our pH.groups but this time, let’s add color by quality to it, let’s see where do most higher quality wines fall.

ggplot(aes(x=pH, fill=quality.factor, color=quality.factor), data=wine)+
  geom_density(aes(alpha=.7))+
  scale_x_continuous(breaks=seq(2.7, 4, .1))+
  scale_y_continuous(breaks=seq(0, 1400, 50))

That’s interesting, well most of our wines, fall in the pH group 3, 3.5 (mid-acidic), but what’s really interesting is that this same group contains the most wines of quality 7 and 8. What we can also notice is that, wine 3, and 4 has the highest pH while quality 7 and 8 have the lowest respectively.
Median quality vs. alcohol

Let’s look in another way on how alcohol affects the quality of wines.

wine_by_alcohol=wine %>% group_by(alcohol) %>% 
  summarize(median_quality=median(quality))%>%arrange(median_quality)

ggplot(aes(x=alcohol, y=median_quality), data=wine_by_alcohol)+
  geom_point()+
  geom_smooth(method='loess')+
  ggtitle('Median Quality vs. Alcohol')

Looking at the above graph, and the median by alcohol, we can notice that quality goes up with alcohol, but to a certain extent. Basically what we can notice is that quality starts to go up with 10% alcohol and stays increasing till about alcohol percentage of 13.2 then starts going back down. I think this is an interesting trend in the data.

Median quality vs. volatile acidity

Now let’s take a closer look at volatile acidity alone, see what we can deduce.

ggplot(aes(x=volatile.acidity , y=quality), data=wine)+
  geom_point(stat="summary", fun.y=median)+
  geom_smooth(method='loess')+
  ggtitle('Median Quality vs. Volatile acidity')
The above plot, and median quality by volatile acidity, shows us what we already thought. Quality is negatively correlated to volatile acidity. The median quality is going down from about 6.3 till about 3.4 as volatile acidity goes up from about 0.1 till 1.6

Trying to predict

Finally I would like to try the data we have to predict the wine quality, we’ll try two models, and see what we get. First of all I’ll change pH into a factor, because I want my model to consider it as a factor and not a float.

Linear Regression Model

Our first model will be the linear regression model, then we’ll plot our predictions against the actual quality of each wine:

Let’s first define 3 functions that will help us test the accuracy of our model.
First function “prediction_testing_and_group” will take a dataframe as a parameter, in this dataframe it will add a column quality.prediction.equal which will be True if our prediction is correct, and false if not. Then this function will create a new dataframe which is created by grouping the provided dataframe by quality, quality.prediction, and quality.prediction.equal, and an addition column named counts which is the count of items in this group.
Second function will plot False and True predcitions and size them by their counts.
Third function will calculate accuracy.

prediction_test_and_group=function(df){
  df$quality.prediction.equal=
    ifelse(df$quality==
            df$quality.prediction, TRUE, FALSE)
  df$quality.prediction.equal=
    factor(df$quality.prediction.equal)
  
  return_df=df%>%group_by(quality.factor, quality.prediction, 
                          quality.prediction.equal)%>%summarize(counts=n())
  
}

plot_predictions=function(df, title){
  ggplot(aes(x=quality.factor, y=quality.prediction, 
           color=quality.prediction.equal, size=counts), 
       data=df)+
  geom_point()+
  ylab('predicted quality')+
  ggtitle(title)
}

calculate_accuracy=function(df, total_count){
  (sum(
  subset(df, quality.prediction.equal==TRUE)$counts)
 /total_count)*100
}


Now the first model:

wine$pH.factor=factor(wine$pH)
m1=lm(quality ~ alcohol, data=wine)
m2=update(m1, ~ . + volatile.acidity)
m3=update(m2, ~ . + sulphates)
m4=update(m3, ~ . + citric.acid)
m5=update(m4, ~ . + total.sulfur.dioxide)
m6=update(m5, ~ . + density)
m7=update(m6, ~ . + chlorides)
m8=update(m7, ~ . + fixed.acidity)
m9=update(m8, ~ . + pH)
mtable(m1, m2, m3, m4, m5, m7, m8, m9)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides, 
##     data = wine)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides + 
##     fixed.acidity, data = wine)
## m9: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides + 
##     fixed.acidity + pH, data = wine)
## 
## =============================================================================================================
##                           m1         m2         m3         m4         m5         m7         m8         m9    
## -------------------------------------------------------------------------------------------------------------
## (Intercept)            1.875***   3.095***   2.611***   2.646***   2.843***   -0.953     28.165      8.571   
##                       (0.175)    (0.184)    (0.196)    (0.201)    (0.205)    (11.990)   (15.083)   (17.045)  
## alcohol                0.361***   0.314***   0.309***   0.309***   0.295***    0.280***   0.268***   0.293***
##                       (0.017)    (0.016)    (0.016)    (0.016)    (0.016)     (0.020)    (0.021)    (0.023)  
## volatile.acidity                 -1.384***  -1.221***  -1.265***  -1.222***   -1.114***  -1.137***  -1.131***
##                                  (0.095)    (0.097)    (0.113)    (0.112)     (0.120)    (0.120)    (0.119)  
## sulphates                                    0.679***   0.696***   0.721***    0.903***   0.916***   0.898***
##                                             (0.101)    (0.103)    (0.103)     (0.112)    (0.112)    (0.112)  
## citric.acid                                            -0.079     -0.043       0.044     -0.198     -0.224   
##                                                        (0.104)    (0.104)     (0.124)    (0.145)    (0.145)  
## total.sulfur.dioxide                                              -0.002***   -0.002***  -0.002***  -0.002***
##                                                                   (0.001)     (0.001)    (0.001)    (0.001)  
## density                                                                        3.923    -25.583     -4.343   
##                                                                              (11.944)   (15.122)   (17.404)  
## chlorides                                                                     -1.747***  -1.584***  -1.827***
##                                                                               (0.406)    (0.408)    (0.419)  
## fixed.acidity                                                                             0.055**    0.016   
##                                                                                          (0.017)    (0.023)  
## pH                                                                                                  -0.442*  
##                                                                                                     (0.180)  
## -------------------------------------------------------------------------------------------------------------
## R-squared                 0.227      0.317      0.336      0.336      0.344      0.352      0.356      0.358 
## adj. R-squared            0.226      0.316      0.335      0.334      0.342      0.349      0.353      0.355 
## sigma                     0.710      0.668      0.659      0.659      0.655      0.652      0.650      0.649 
## F                       468.267    370.379    268.912    201.777    166.962    123.298    109.751     98.534 
## p                         0.000      0.000      0.000      0.000      0.000      0.000      0.000      0.000 
## Log-likelihood        -1721.057  -1621.814  -1599.384  -1599.093  -1589.749  -1580.138  -1575.112  -1572.088 
## Deviance                805.870    711.796    692.105    691.852    683.814    675.643    671.409    668.874 
## AIC                    3448.114   3251.628   3208.768   3210.186   3193.499   3178.276   3170.224   3166.177 
## BIC                    3464.245   3273.136   3235.654   3242.448   3231.138   3226.670   3223.995   3225.325 
## N                      1599       1599       1599       1599       1599       1599       1599       1599     
## =============================================================================================================
wine$quality.prediction=round(fitted(m9))
wine_by_quality_and_pred=prediction_test_and_group(wine)
  
plot_predictions(wine_by_quality_and_pred, 
                 "Predicted Quality (Linear Regression Model) vs. 
                 Actual Quality")

calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 59.28705
As we can see above, we created a linear regression model, then created a new variable in our data frame, quality.prediction, and quality.prediction.equal which is True if the predicted quality is equal to the actual quality, and false if not. We then create a new dataframe, by grouping our data on quality, quality.prediction and quality.prediction.equal, then we summarize to get the count of data in each of these groups. Then we plot these.
Looking at the plot we can notice the blue points, these are the correct predictions. We can notice that on actual quality 3 red points on 4, 5, and 6 these points are about the same size, this means that all of our predictions for quality 3 were wrong. For quality 4 we have a small blue point which is the small number of wines that we correctly predicted to be of level 4, then we see 2 red points on 5 and 6 these are the wines of actual quality 4, that we predicted as being 5 or 6. For qualities 5 and 6 things get better. We can notice that the blue points are much bigger than the other red points which means that we did well for these qualities. We didn’t do that good for quality 7 as we can see the point we have on predicted quality 7 is smaller than the point at 6. And at last we didn’t get any correct prediction for quality 8.

We have an R2 of 0.358. And calculating the accuracy we have by summing up the number of correct predictions then dividing by the total count of our data 1599, then multiplying by 100 we see that we have and accuracy level of 59.29%.

SVM - Support Vector Machine

Let’s try another model, and see how well can we do. This model will be the SVM model (Support Vector Machine). for this model, I’ll change quality to factor too, I want my model to be a classification model because actually quality isn’t an integer but a classification. Let’s see what we get:

#Let's first drop the columns that we don't need anymore.
wine$pH.groups=NULL
wine$quality.prediction=NULL
wine$quality.prediction.equal=NULL
wine$pH.factor=NULL
model = svm(quality.factor~., data = wine)
wine$quality.prediction=fitted(model)
wine_by_quality_and_pred=prediction_test_and_group(wine)
summary(model)
## 
## Call:
## svm(formula = quality.factor ~ ., data = wine)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.08333333 
## 
## Number of Support Vectors:  482
## 
##  ( 169 162 87 36 18 10 )
## 
## 
## Number of Classes:  6 
## 
## Levels: 
##  3 4 5 6 7 8
plot_predictions(wine_by_quality_and_pred, 
                 "Predicted Quality (SVM) vs. Actual Quality")

calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 99.49969

Doing the same thing for our SVM as we did for our regression model. We can see a VERY big difference in our predictions. 99.5% accuracy! Looking at the graph we can barely see a red point indicating few wrong predictions.

Let’s try playing a bit with our features with respect to the stuff we studied above, let’s see if we can increase our model’s score. Here we go:

feature_cols=c('alcohol', 'volatile.acidity', 'citric.acid', 
               'total.sulfur.dioxide', 'density', 'quality.factor', 'quality')
wine_for_svm=wine[,(names(wine) %in% feature_cols)]
model = svm(quality.factor~., data = wine_for_svm)
wine_for_svm$quality.prediction=fitted(model)
wine_by_quality_and_pred=prediction_test_and_group(wine_for_svm)
plot_predictions(wine_by_quality_and_pred, 
                 "Predicted Quality (SVM) vs. Actual Quality")


No more “False” predictions, that’s a 100% accuracy, as good as it can get!

calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 100

Changing the features we use by removing some that aren’t really correlated to our quality, we increased our models accuracy to the maximum. Now let’s not be over excited, let’s test our model.

Testing SVM on another data list

Since most of our data lies between qualities 5 and 6 as we’ve seen in the first section of this report, we have 1319 in these 2 qualities (82.5% of our data). We can’t really say that we can totally rely on our 100% accuracy, so I downloaded another csv file to run my model against it and see what we get.
file:

https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/Training50_winedata.csv

I manually deleted the last column, rt.sulfur.dioxide, since we don’t have it in our original data.
Here we go:

test_data=read.csv('training_wine.csv')
test_data$X=NULL
dim(test_data)
## [1] 2037   12
str(test_data)
## 'data.frame':    2037 obs. of  12 variables:
##  $ fixed.acidity       : num  7.5 6.3 7 7.4 8.1 6.6 7.3 6.9 8.5 7.2 ...
##  $ volatile.acidity    : num  0.33 0.27 0.3 0.38 0.12 0.2 0.26 0.32 0.18 0.27 ...
##  $ citric.acid         : num  0.32 0.29 0.51 0.27 0.38 0.38 0.36 0.17 0.3 0.28 ...
##  $ residual.sugar      : num  11.1 12.2 13.6 7.5 0.9 7.9 5.2 7.6 1.1 15.2 ...
##  $ chlorides           : num  0.036 0.044 0.05 0.041 0.034 0.052 0.04 0.042 0.028 0.046 ...
##  $ free.sulfur.dioxide : num  25 59 40 24 36 30 31 69 34 6 ...
##  $ total.sulfur.dioxide: num  119 196 168 160 86 145 141 219 95 41 ...
##  $ density             : num  0.996 0.998 0.998 0.995 0.99 ...
##  $ pH                  : num  3.15 3.14 3.07 3.17 2.8 3.32 3.16 3.13 2.83 3.17 ...
##  $ sulphates           : num  0.34 0.4 0.52 0.43 0.55 0.56 0.59 0.4 0.36 0.39 ...
##  $ alcohol             : num  10.5 8.8 9.6 10 12 11 11 8.9 10 10.9 ...
##  $ quality             : int  6 6 7 5 6 7 6 5 4 6 ...

That’s a bit bigger data frame, than our actual data, but we have the same fields list in it as we can see above. Let’s go, let’s first take out all the fields that aren’t in our model, and create the ones that are, but not in the orginal data. Then we can test.

Let’s just take a quick look at the quality distribution of the dataset, and then go on with our testing.

ggplot(aes(x=quality), data=test_data)+
  geom_histogram(binwidth=.1, fill='blue')+
  scale_x_continuous(breaks=c(3:9))+
  scale_y_continuous(breaks=seq(0, 950, 50))+
  ggtitle('Test Data Wine Quality Counts')
## Warning: position_stack requires constant width: output may be incorrect

Well there isn’t a big difference in the distribution of this dataset, We have a bit less 3, 4 and 5 qualities. It has about 1.5 more 6 quality, about double quality 7, a bit more in 8 quality and a small number of 9’s.

test_data$quality.factor=factor(test_data$quality)
test_data=test_data[,(names(test_data) %in% feature_cols)]

test_data$quality.prediction=predict(model, test_data)
test_data_by_quality_and_pred=prediction_test_and_group(test_data)
plot_predictions(test_data_by_quality_and_pred, 
                 "Predicted Quality (SVM) vs. Actual Quality")


Interesting! Looking at the plot we can also barely see the red points, we see more than the plot before, but it is still not a lot. Let’s see how much is our accuracy on this test data.

calculate_accuracy(test_data_by_quality_and_pred, 
                   length(test_data$quality.prediction))
## [1] 89.34708
Here we go, a nice 89.35% accuracy, that’s lower than the 100% we got on our actual data, but that’s still high. Well I really didn’t expect that!

Final Plots

There are 3 plots I would like to revisit, apply some tweaking to them to make them look better, I am revisiting these 3 plots because I think they are the 3 most important ones in our study.
Plot 1: Median quality vs. alcohol
ggplot(aes(x=quality.factor, y=alcohol, color=quality.factor), data=wine)+
  geom_boxplot()+
  scale_color_manual(values=rev(brewer.pal(9, "Blues")), 
                     name="Quality (classifier)")+
  ylab('Alcohol (% by volume)')+
  xlab('Quality (Classifier)')+
  ggtitle('Alcohol vs. Median Quality')
Looking at this plot, we notice that quality 3 has maximum alcohol of about 11, and its median is a bit less than 10, then quality 4 goes up to about 13, median at 10, then we notice that qualities 5 to 8 go to higher levels, and the mean is gradually incrementing between qualities 6, 7 and 8. At the same time we see that it is not a rule of thumb that higher alcohol always means better quality, looking at the plot we see that only quality 5 goes over 14% alcohol, at the same time we can see that some wines with qualities 7 and 8 go down to about 6% and 9% of alcohol respectively. So the rule is not too much alcohol, and too low alcohol, best is between 9% and 14%.

Plot 2: Median quality vs. volatile acidity
ggplot(aes(x=quality.factor, y=volatile.acidity, color=quality.factor),
       data=wine)+
  geom_boxplot()+
  scale_color_manual(values=rev(brewer.pal(9, "Blues")), 
                     name="Quality (classifier)")+
  xlab('Quality (Classifier)')+
  ylab('Volatile Acidity (g / dm^3)')+
  ggtitle('Volatile acidity vs. Quality')
Well at first look, we can see that the only quality that goes the highest is 3, so we can directly say too much volatile acidity is bad for wine. Second highest is 5, we can say that to go to high levels we have to have volatile acidity lower than 1 g/dm^3, and the best levels are between 0.25 and 0.85 g/dm^3. What is most explanatory in this plot is that as quality goes up we can see the median lines of the boxes going down gradually looking like a staircase.

Plot 3: Volatile acidity vs. citric acid
ggplot(aes(x=volatile.acidity, y=citric.acid,
           color=quality.factor),
       data=subset(wine, volatile.acidity>=.2 & volatile.acidity<=1.2))+
  geom_point(stat='summary', fun.y=median)+
  scale_x_continuous(breaks=seq(0.2, 1.2, .2))+
  scale_size_discrete(guide = guide_legend(title = 'Quality (classifier)'))+
  scale_color_manual(values=rev(brewer.pal(9, 'Blues')),
    guide = guide_legend(title = 'Quality (classifier)'))+
  ylab('Citric Acid (g / dm^3)')+
  xlab('Volatile Acidity (g / dm^3)')+
  ggtitle('Volatile Acidity vs. Citric Acid, 
          colored, sized by Quality')
As we said before, and we can see it more here, in this plot we notice the points gruadually descending mostly following some sort of a descending curve what we can also notice that when some points are going out of the curve, over the curve, their quality is going lower, especially when volatile acidity is high as soon as citric acid is higher, quality goes down significantly.

Reflections

The first thing I wan to say here is that next time I drink a good wine, I’ll be very grateful! Seems like the process of making wine is a very sensitive process!
Although we have some features that are somewhat correlated to the wine quality, I never thought we can attain an accuracy of 89.5%, especially that the qualities aren’t too much diversified, and the data set we have isn’t that big. I thought that having most our qualitiesbetween 5 and 6 won’t help being able to predict. But this shows that SVM is areally much better model for classification prediction, such as our case now, where quality isn’t an actual number but a class…

What is also interesting, and causes struggles, is that looking at the first and second plot of our final plots, we can see that there are so much levels of volatile acidity and alcohol (the features that mostly affect quality) where the wine can be of any quality. Like for example a wine with volatile acidity of 0.8 g/dm^3 and alcohol of 11% can be anywhere between quality 3 and 8! So the struggle was to find out what other stuff might help in predicting the quality, like some stuff are not much correlated to quality but if we can, let’s say, split the wine between 4 out of 10 qualities then these other features may help us decide in which quality this wine lies between these 4. For example as we can see in third plot if volatile acidity is high, but not so high to really decide what quality this wine could be, we can look at another feature which is citric acid and if it is high, we can say that this wine is between qualities 3 and 5.

An interesting improvement for the data is having wine age in it. It is nice to see how and when does age actually affect the wine quality. Maybe also where the wine was stored and stuff like that. These information would really be very interesting to discover and play with.

Another improvement is to include more regions in the data. The current data only includes Portuguese “Vinho Verde” wine, including more regions and more wine brands would help in eliminating bias, and including more diverse data.

Citation:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Additional Resources

http://www.brsquared.org/wine/Articles/SO2/SO2.htm
http://winemakersacademy.com/understanding-wine-acidity/
https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/Training50_winedata.csv